[RFC][WIP] Common: Add an Initial Chat Memory Interface/Implementation #12698

markhpc · 2025-04-01T23:01:05Z

This is a rough proof-of-concept for implementing a chat memory interface inspired by ChatGPT's memories feature. It is separated into 3 parts:

An interface for main.cpp and server.cpp to interact with the chat memory.
A common class for various JSON and helper functions
A "simple" key/value store implementation based on unordered_map.

A key goal for this POC was to minimize the changes to main/server and keep as much of the logic in the chat-memory classes as possible. One specific change that was necessary for instance was to pass the conv_id from the webui back to the server so that each session has its own memory. Potentially per-user or per-group memories could be implemented as well. A future goal for this project would be to allow integration with local databases, S3, and Ceph to store these memories persistently.

The simple implementation has a lot of code dedicated to trying to keep the model from hallucinating about the state of the memory, with limited success. The model used for testing this is Gemma 3 4B Q8 and it aggressively wants to trust its own training to make up fake statistics. It's possible larger or other models may behave better, however this will need active work and may require specialized training to work consistently.

In addition to the above issue (among others!), this POC has several deficiencies:

The way the server prompt is being injected in main.cpp is bluntly overwriting any defaults and may require per-model injection rules in llama-chat.cpp.
Certain things like the memory size and debugging flag are hard-coded and should be made configuration options.
The chat-memory should be disabled by default and only enabled via a command-line flag.
The simple implementation does not yet have any TTL or other code to expire per-session memory once it's no longer valid.

My goal before taking this any further is to solicit feedback from ggml and the greater community to see if this project merits continued development. While the vast majority of the code is in ChatMemorySimple, the more important pieces to focus on IMHO is the interface, base class, and modifications to the existing code.

markhpc · 2025-04-02T00:19:09Z

Might help make the idea clearer...

markhpc · 2025-04-03T02:30:27Z

@ggerganov @ngxson If you would be willing, I'd like to hear any thoughts you have. I may dramatically change the backend memory implementation, but I want to make sure the way I'm interacting with main.cpp and server.cpp is reasonable.

Signed-off-by: Mark Nelson <[email protected]>

ggerganov · 2025-04-03T05:48:09Z

@markhpc I am not familiar with the "ChatGPT memories" feature and how it works. And briefly looking at the implementation I still don't know what it is (excuse me if it is something obvious).

But I would go out on a limb and say that most likely this is something we don't want to implement in the llama.cpp server due to complexity and questionable usefulness.

ngxson · 2025-04-03T06:47:40Z

Agree with @ggerganov . This feature is a cool UX but will be very difficult to maintain. I would categorize such features as "prompt engineering" and not actually an inference feature.

Indeed, before chatgpt even have the memory feature, I also implemented myself this feature in my own llama.cpp private fork using both prompts and llama_kv shifting API. It works for a while, but very tricky, doesn't work with all kinds of models.

I think in the future, with the addition of MCP in server web UI, this could be implemented in a more generic way. All cool things that people talk about like MCP, agent, tool calling, RAG are just prompt engineering anyway, just matter of how to organize the code.

markhpc · 2025-04-03T16:56:02Z

@ggerganov @ngxson Thank you both for your quick feedback!

FWIW, the goal here isn’t to replicate ChatGPT’s memory feature as a UX layer or only via prompting. My goal is to introduce an interface for directly interacting with inference at a deeper level. Right now that's to provide access to structured, namespaced data storage (key/value in this case). The demo here is just a std::map, but it could easily be sqlite3, S3, or Ceph. In the future I want to do more. I want to be able to eventually enable mid-stream behavioral constraint.

That's why I tried to keep the implementation (ChatMemorySimple) separated from the interface that enables it (which I should probably rename, it's really an inference hook). The long term goal is to support external governance scaffolding: tools for hallucination recovery, telos tracking, violation logging, and long-term reasoning...in addition to storing user memories.

I suspect that without these kinds of structures, persistent memory features will always be fragile unless reinforced through fine-tuning or runtime constraint. This is an attempt to prototype a real runtime cognition layer, not just simulate memory within the model’s weights. This is my first stab at trying to move some of this from model-level simulation into real code using real storage.

If this is something you guys think might be interested in, I would love figure out a lightweight way to tie into the inference loop. That's the key piece I believe I need as I'm not sure I can do everything completely externally.

ngxson · 2025-04-03T18:03:06Z

The long term goal is to support external governance scaffolding: tools for hallucination recovery, telos tracking, violation logging, and long-term reasoning...in addition to storing user memories.

Aren't these just prompt engineering?

ngxson · 2025-04-03T18:08:55Z

If this is something you guys think might be interested in, I would love figure out a lightweight way to tie into the inference loop. That's the key piece I believe I need as I'm not sure I can do everything completely externally.

Which missing API calls from llama.h blocking you from doing all of these in an external code?

markhpc · 2025-04-03T19:35:19Z

@ngxson Thank you!

My background is in storage and this is my first dive into the llama.cpp code, so I confess I'm still working to understand exactly what I need.

I believe it might look something like this though:

Pre-inference prompt hook: to inspect or inject constraint/memory (format_injection_prompt)
Post-inference output stream access: to intercept and optionally audit generation tokens or final messages (process_response)
Structured command routing: to allow structured JSON or role commands inside model output (parse_and_execute_command*, should be renamed).
Streaming inference monitoring: to track behavior chunk-by-chunk during generation.
Session lifecycle hook: to reset or checkpoint runtime state between inference cycles (not yet implemented).

Most of this is already there and being used in the ChatMemory interface in this PR. I reflect, it might be better to rename this to something like "InferenceHook". The core idea here is to see if this kind of inference-aware runtime behavior shaping could be an optional path forward. I 100% agree that it needs to be opt-in and lightweight though. My hope is that this could allow for a huge amount of flexibility for future developers.

markhpc · 2025-04-07T13:39:36Z

Update: In parallel I'm working on using the same interface for a governance model where I re-inject feedback into the next prompt based on the previous response. This works, but at least with Gemma3 doesn't consistently override undesirable behavior so I'm now working to learn how logit biases work and where I could potentially modify them. My current goal is to create per-session in-order tracking of tasks so I can then do things like compare responses, set up dynamic logit biases, look at drift, etc. I believe I can do this from within handle_completions_impl before inference begins, but would need to tie into update_slots. if I wanted to bias logits mid-stream. Alternately I could try to extend the code in sampling.cpp with my own sampler, but I'm a bit wary of that approach.

markhpc · 2025-04-19T04:05:14Z

Update: I've spent a fair amount of time trying to figure out ways to get the model I've been testing with (Gemma3 4B Q8) to regularly adopt the interface. It sporadically will execute key/value store commands, sometimes even searching in the store for instructions about how to use the commands (which I've tried placing there as a backup to the system prompt), but it is as likely to make up key/value pairs as it is to actually use the tool. In some cases, it will even return non existent keys as the result of a "list" command and return the result of the list command in the same response. I've also tried to implement mid-stream corrections as a test, but that likewise was a failure. Perhaps if I was testing with a higher parameter model it would do a better job.

@ngxson You were right, it was quite tricky in the end. I don't think that this is going to work the way I hoped it would unless someone has some idea of how to make tool usage more attractive to the model. I even thought about trying to change the probability distribution to favor the KV commands, but it all felt very invasive and brittle. I also ended up going down a bit of a labyrinth trying to find ways to enforce the rules via very elaborate prompt engineering (while learning how loaded some of the words I was using earlier are!), but the longer sessions go on, the worse the model gets rather than better (as you alluded to). Perhaps the other approaches you mentioned would work better.

markhpc · 2025-04-19T04:29:50Z

Closing this for now, since I don't see it having a high likelihood of success as-is. Will open a new PR if that changes.

markhpc requested a review from ngxson as a code owner April 1, 2025 23:01

github-actions bot added examples server labels Apr 1, 2025

markhpc force-pushed the wip-markhpc-chat_memory-poc branch from 29a9f5f to 236241c Compare April 2, 2025 01:40

ngxson removed their request for review April 2, 2025 07:47

Mark Nelson added 4 commits April 2, 2025 22:09

Chat memory interface and simple implementation.

bc75036

Signed-off-by: Mark Nelson <[email protected]>

examples/main: Add support for the chat memory.

58844d1

Signed-off-by: Mark Nelson <[email protected]>

examples/server: Pass the conv_id from the webui to the server.

67e94a0

Signed-off-by: Mark Nelson <[email protected]>

examples/server: Add support for the chat memory.

02d643b

Signed-off-by: Mark Nelson <[email protected]>

markhpc force-pushed the wip-markhpc-chat_memory-poc branch from 236241c to 02d643b Compare April 3, 2025 03:10

markhpc closed this Apr 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][WIP] Common: Add an Initial Chat Memory Interface/Implementation #12698

[RFC][WIP] Common: Add an Initial Chat Memory Interface/Implementation #12698

markhpc commented Apr 1, 2025

markhpc commented Apr 2, 2025

markhpc commented Apr 3, 2025

ggerganov commented Apr 3, 2025 •

edited

Loading

ngxson commented Apr 3, 2025 •

edited

Loading

markhpc commented Apr 3, 2025

ngxson commented Apr 3, 2025

ngxson commented Apr 3, 2025

markhpc commented Apr 3, 2025 •

edited

Loading

markhpc commented Apr 7, 2025 •

edited

Loading

markhpc commented Apr 19, 2025 •

edited

Loading

markhpc commented Apr 19, 2025

[RFC][WIP] Common: Add an Initial Chat Memory Interface/Implementation #12698

[RFC][WIP] Common: Add an Initial Chat Memory Interface/Implementation #12698

Conversation

markhpc commented Apr 1, 2025

markhpc commented Apr 2, 2025

markhpc commented Apr 3, 2025

ggerganov commented Apr 3, 2025 • edited Loading

ngxson commented Apr 3, 2025 • edited Loading

markhpc commented Apr 3, 2025

ngxson commented Apr 3, 2025

ngxson commented Apr 3, 2025

markhpc commented Apr 3, 2025 • edited Loading

markhpc commented Apr 7, 2025 • edited Loading

markhpc commented Apr 19, 2025 • edited Loading

markhpc commented Apr 19, 2025

ggerganov commented Apr 3, 2025 •

edited

Loading

ngxson commented Apr 3, 2025 •

edited

Loading

markhpc commented Apr 3, 2025 •

edited

Loading

markhpc commented Apr 7, 2025 •

edited

Loading

markhpc commented Apr 19, 2025 •

edited

Loading